Using twitter we’ll try to find how people feel about Christmas.
Aims: Show how to access APIs and show text mining of the resulting data Train and using a glmnet model for predicting the sentiment of tweets Use of leaflet to visualise the data
See Paul’s previous talk for good 101 on text-mining. https://github.com/RUMgroup/Text-mining Reka’s previous talk on leaflet https://github.com/RUMgroup/leaflet_tutorial
Load/install the other libraries needed for this work
packages<-c("twitteR","streamR","ROAuth","DT","glmnet","text2vec","maps","leaflet","rgdal","raster","maptools","RColorBrewer")
p<-sapply(packages,function(x) {
if (!require(x,character.only = T))
install.packages(x)
library(x,character.only = T)
})
## Loading required package: twitteR
## Loading required package: streamR
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjson
## Loading required package: ROAuth
## Loading required package: DT
## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-13
## Loading required package: text2vec
## Loading required package: maps
## Loading required package: leaflet
## Loading required package: rgdal
## Loading required package: sp
## rgdal: version: 1.2-16, (SVN revision 701)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 1.10.1, released 2013/08/26
## Path to GDAL shared files: /usr/share/gdal/1.10
## GDAL binary built with GEOS: TRUE
## Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480]
## Path to PROJ.4 shared files: (autodetected)
## Linking to sp version: 1.2-4
## Loading required package: raster
## Loading required package: maptools
## Checking rgeos availability: FALSE
## Note: when rgeos is not available, polygon geometry computations in maptools depend on gpclib,
## which has a restricted licence. It is disabled by default;
## to enable gpclib, type gpclibPermit()
## Loading required package: RColorBrewer
We’ll use the twitter API to access twitter data in an R friendly way.
We first need to connect to the twitter API which requires an account, an registered app to generate api keys and access tokens. Create an app here:
#need a dev twitter account - can make your own easily
API_Key <- ""
API_Secret <- ""
Access_Token <- ""
Access_Secret <- ""
#authenticate in headless mode
setup_twitter_oauth(API_Key, API_Secret,Access_Token,Access_Secret)
#Let's search past tweets
#has a tendency to be rate limited - wait 15 mins between each big search - takes a couple of mins to complete
tweetsTrump <-searchTwitter(searchString="@realDonaldTrump",n=100,lang="en")
saveRDS(tweetsTrump,file = "data/tweetsTrump.RDS")
searchTwitter returns a list of status objects contain the tweet text and meta info. Reference classes look quite odd in R, more like Java/Python. They call a method from an object and are mutable.
tweetsTrump <- readRDS("data/tweetsTrump.RDS")
#look at the class of the first element in the returned list
class(tweetsTrump[[1]])
## [1] "status"
## attr(,"package")
## [1] "twitteR"
#take a peak at the class structure
str(tweetsTrump[[1]])
## Reference class 'status' [package "twitteR"] with 17 fields
## $ text : chr "@realDonaldTrump \"Not\" funny old school reference"
## $ favorited : logi FALSE
## $ favoriteCount: num 0
## $ replyToSN : chr "realDonaldTrump"
## $ created : POSIXct[1:1], format: "2017-12-01 09:40:26"
## $ truncated : logi FALSE
## $ replyToSID : chr "924649059520073730"
## $ id : chr "936530418324475904"
## $ replyToUID : chr "25073877"
## $ statusSource : chr "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>"
## $ screenName : chr "matthew_darden"
## $ retweetCount : num 0
## $ isRetweet : logi FALSE
## $ retweeted : logi FALSE
## $ longitude : chr(0)
## $ latitude : chr(0)
## $ urls :'data.frame': 0 obs. of 4 variables:
## ..$ url : chr(0)
## ..$ expanded_url: chr(0)
## ..$ dispaly_url : chr(0)
## ..$ indices : num(0)
## and 53 methods, of which 39 are possibly relevant:
## getCreated, getFavoriteCount, getFavorited, getId, getIsRetweet,
## getLatitude, getLongitude, getReplyToSID, getReplyToSN, getReplyToUID,
## getRetweetCount, getRetweeted, getRetweeters, getRetweets,
## getScreenName, getStatusSource, getText, getTruncated, getUrls,
## initialize, setCreated, setFavoriteCount, setFavorited, setId,
## setIsRetweet, setLatitude, setLongitude, setReplyToSID, setReplyToSN,
## setReplyToUID, setRetweetCount, setRetweeted, setScreenName,
## setStatusSource, setText, setTruncated, setUrls, toDataFrame,
## toDataFrame#twitterObj
#get the screen name
tweetsTrump[[1]]$getScreenName()
## [1] "matthew_darden"
#how times has it been retweeted?
tweetsTrump[[1]]$getRetweetCount()
## [1] 0
#Get the text
tweetsTrump[[1]]$getText()
## [1] "@realDonaldTrump \"Not\" funny old school reference"
tweetsTrumpText <- sapply(tweetsTrump,function(x) x$getText())
tweetsTrumpText <-iconv(tweetsTrumpText, "latin1", "ASCII", "")
datatable(as.data.frame(tweetsTrumpText),rownames = F)
We are limited to how far we can search back and how much data we can gather.
Let’s listen instead - need to authenticate using ROAUTH which will open a webpage and give you a pin to enter in R
requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
my_oauth <- OAuthFactory$new(consumerKey=API_Key,
consumerSecret=API_Secret, requestURL=requestURL,
accessURL=accessURL, authURL=authURL)
#should open up browser and give you a pin to type into R
my_oauth$handshake()
#Listen to all english language christmas tweets and store as json
filterStream(file.name="tweets_keyword", track=c("Christmas","Xmas"),tweets=200000,oauth=my_oauth,language="en")
tweets.Christmas <- parseTweets("tweets_keyword", verbose = TRUE)
tweets.Christmas <- tweets.Christmas [!duplicated(tweets.Christmas$text),]
saveRDS(tweets.Christmas,file="data/tweets.Christmas.RDS")
Look at the tweets
tweets.Christmas <- readRDS("data/tweets.Christmas.RDS")
#make sure the text is in ASCII
tweets.Christmas$text <-iconv(tweets.Christmas$text, "latin1", "ASCII", "")
#datatable(tweets.Christmas[1:100,],rownames = F)
We wish to rate the tweets on happiness. This would be time consuming by hand so we can use a classification model trained on 1.6 million tweets to predict the sentiment.
see makeSentimentModel.R for detail of how the model can be created- takes ~40 mins on single cpu
#load the model
sentimentModel <- readRDS("data/glmnet_classifier.RDS")
#load the vectoriser function
vectorizer <- readRDS("data/vectorizer.RDS")
# preprocessing and tokenization
it_tweets <- itoken(tweets.Christmas$text,
preprocessor = tolower,
tokenizer = word_tokenizer,
progressbar = TRUE)
# creating vocabulary and document-term matrix
dtm_tweets <- create_dtm(it_tweets, vectorizer)
## Warning in cpp_vocabulary_corpus_create(vocabulary$term, attr(vocabulary, :
## '.Random.seed' is not an integer vector but of type 'NULL', so ignored
##
|
|======= | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================= | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|=========================================================== | 90%
|
|=================================================================| 100%
# transforming data with tf-idf
dtm_tweets_tfidf <- fit_transform(dtm_tweets, TfIdf$new())
# predict probabilities of positiveness
preds_tweets <- predict(sentimentModel, dtm_tweets_tfidf, type = 'response')[ ,1]
# adding rates to initial dataset
tweets.Christmas$sentiment <- preds_tweets
tweets.Christmas.filt <- tweets.Christmas[order(tweets.Christmas$sentiment),]
tweets.Christmas.filt <- tweets.Christmas.filt[c(1:100,(nrow(tweets.Christmas.filt)-100):nrow(tweets.Christmas.filt)),]
tweets.Christmas.filt$text <-iconv(tweets.Christmas.filt$text, "UTF-8", "ISO-8859-1", "")
#look at the text and the sentiment
datatable(tweets.Christmas.filt[,c("text","sentiment")],rownames = F)
boxplot(tweets.Christmas$sentiment)
#get longitude and lattidue for tweets from the location data
data(world.cities)
#modified function from - http://biostat.jhsph.edu/~jleek/code/twitterMap.R
findLatLon <- function(loc){
latlon = NA
cont = NA
# Asia = 1, Africa = 2, North America = 3, South America = 4, Australia/New Zealand = 5, Europe = 6
continents = matrix(NA,nrow=length(unique(world.cities[,2])),ncol=2)
continents[,1] = unique(world.cities[,2])
continents[1:10,2] = c(1,1,1,2,1,1,1,1,1,1)
continents[11:20,2]= c(1,1,2,1,1,2,1,2,2,2)
continents[21:30,2] = c(2,1,6,6,6,6,6,6,6,6)
continents[31:40,2] = c(6,6,6,6,2,4,4,1,2,1)
continents[41:50,2] = c(4,6,1,4,6,1,3,1,6,6)
continents[51:60,2] = c(3,2,4,2,6,1,6,1,3,2)
continents[61:70,2] = c(1,2,2,2,3,6,3,3,6,6)
continents[71:80,2] = c(1,1,2,6,3,4,3,4,6,1)
continents[81:90,2] = c(3,3,3,2,2,6,6,6,6,4)
continents[91:100,2] = c(2,5,2,2,3,1,1,1,1,1)
continents[101:110,2] = c(1,2,1,1,1,3,2,5,1,6)
continents[111:120,2] = c(1,6,1,1,2,6,1,1,6,2)
continents[121:130,2] = c(6,6,6,1,1,3,4,3,4,2)
continents[131:140,2] = c(6,6,2,2,1,1,1,4,1,1)
continents[141:150,2] = c(1,2,2,1,1,1,4,6,6,2)
continents[151:160,2] = c(4,1,1,1,1,2,4,6,2,2)
continents[161:170,2] = c(1,2,2,1,6,2,1,1,6,1)
continents[171:180,2] = c(1,1,1,2,6,2,2,6,1,1)
continents[181:190,2] = c(2,6,2,1,6,6,3,3,3,3)
continents[191:200,2] = c(2,2,2,2,3,2,3,2,3,1)
continents[201:210,2] = c(3,2,2,2,2,2,2,1,6,2)
continents[211:220,2] = c(1,3,1,6,2,4,3,6,3,4)
continents[221:230,2] = c(1,1,1,3,2,3,3,6,1,6)
continents[231:232,2] = c(2,1)
# Get the first element of the location
# firstElement = strsplit(loc,"[^[:alnum:]]")[[1]][1]
firstElement = strsplit(loc,",")[[1]][1]
if(is.na(firstElement)){firstElement="zzzzzzzzz"}
# See if it is a city
tmp = grep(firstElement,world.cities[,1],fixed=TRUE)
tmp2 = grep(firstElement,state.name,fixed=TRUE)
tmp3 = grep(firstElement,world.cities[,2],fixed=TRUE)
if(length(tmp) == 1){
latlon = world.cities[tmp,c(5,4)]
cont = continents[which(world.cities[tmp,2]==continents[,1]),2]
}else if(length(tmp) > 1){
tmpCities = world.cities[tmp,]
latlon = tmpCities[which.max(tmpCities$pop),c(5,4)]
cont = continents[which(tmpCities[which.max(tmpCities$pop),2]==continents[,1]),2]
}else if(length(tmp2) == 1){
latlon = c(state.center$x[tmp2],state.center$y[tmp2])
cont = 3
}else if(length(tmp3) > 0){
tmpCities = world.cities[tmp3,]
latlon = tmpCities[which.max(tmpCities$pop),c(5,4)]
cont = continents[which(tmpCities[which.max(tmpCities$pop),2]==continents[,1]),2]
}
#return(list(latlon=latlon,cont=as.numeric(cont)))
return(latlon)
}
tweets.Christmas$location <-iconv(tweets.Christmas$location, "latin1", "ASCII", "")
locs<-as.data.frame(tweets.Christmas$location)
locs_lat <-apply(locs,1,findLatLon)
saveRDS(locs_lat,"data/loc_lat.RDS")
locs_lat <- readRDS("data/loc_lat.RDS")
tweets.Christmas$longitude <- unlist(lapply(locs_lat,"[",1))
tweets.Christmas$latitude <- unlist(lapply(locs_lat,"[",2))
tweets.Christmas<- tweets.Christmas[ !is.na(tweets.Christmas$longitude),]
m <- leaflet(tweets.Christmas[1:500,]) %>%
addProviderTiles("CartoDB.Positron") %>%
addMarkers(lng=~longitude, lat=~latitude)
m
Colour each area by mean sentiment
#get a shape file
regions <- getData('GADM', country='GB', level=2)
#find the intersect of the lat-long with the shapefile polygons
p <- SpatialPointsDataFrame(coords = tweets.Christmas[,c("longitude","latitude")],data=data.frame(ID=paste0("tweet",1:nrow(tweets.Christmas)),sentiment=tweets.Christmas$sentiment))
proj4string(p)<-CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
regions <- spTransform(regions, CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"))
#intersect
res <- over(regions, p,returnList = T)
#mean sentiment per polygon
res <- lapply(res,function(x) {
if(nrow(x)<1){
return(0.5)
} else{
mean(x[,"sentiment"])
}})
#assign the mean sentiment to the polygon
regions$sentiment <-unlist(res)
#What are the happy Chrimas tweeting places?
head(regions[ order(regions$sentiment,decreasing = F),])
## OBJECTID ID_0 ISO NAME_0 ID_1 NAME_1 ID_2
## 113 113 242 GBR United Kingdom 2 Northern Ireland 113
## 129 129 242 GBR United Kingdom 2 Northern Ireland 129
## 122 122 242 GBR United Kingdom 2 Northern Ireland 122
## 135 135 242 GBR United Kingdom 2 Northern Ireland 135
## 149 149 242 GBR United Kingdom 3 Scotland 149
## 152 152 242 GBR United Kingdom 3 Scotland 152
## NAME_2 HASC_2 CCN_2 CCA_2 TYPE_2 ENGTYPE_2
## 113 Antrim GB.AN NA District District
## 129 Larne GB.LR NA District District
## 122 Coleraine GB.CL NA District District
## 135 Newtownabbey GB.NW NA District District
## 149 East Renfrewshire GB.ER NA Unitary District Unitary District
## 152 Falkirk GB.FK NA Unitary District Unitary District
## NL_NAME_2 VARNAME_2 sentiment
## 113 0.1486539
## 129 0.3972214
## 122 0.4834109
## 135 0.4837799
## 149 0.4910031
## 152 0.4933216
#make a colour palette - purple=sad,green=happy
col <- colorNumeric("PiYG",domain=c(0,1))
#map the shape file coloured by the mean sentiment
m <- leaflet(regions) %>% addProviderTiles("Stamen.Toner") %>%
addPolygons( stroke=F,color = ~col(sentiment))
m